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Abstract 

Background: Gene set analyses have become increasingly important in genomic research, as many complex 
diseases are contributed jointly by alterations of numerous genes. Genes often coordinate together as a functional 
repertoire, e.g., a biological pathway/network and are highly correlated. However, most of the existing gene set 
analysis methods do not fully account for the correlation among the genes. Here we propose to tackle this important 
feature of a gene set to improve statistical power in gene set analyses. 

Results: We propose to model the effects of an independent variable, e.g., exposure/biological status (yes/no), on 
multiple gene expression values in a gene set using a multivariate linear regression model, where the correlation 
among the genes is explicitly modeled using a working covariance matrix. We develop TEGS (Test for the Effect of a 
Gene Set), a variance component test for the gene set effects by assuming a common distribution for regression 
coefficients in multivariate linear regression models, and calculate the p-values using permutation and a scaled 
chi-square approximation. We show using simulations that type I error is protected under different choices of working 
covariance matrices and power is improved as the working covariance approaches the true covariance. The global 
test is a special case of TEGS when correlation among genes in a gene set is ignored. Using both simulation data and a 
published diabetes dataset, we show that our test outperforms the commonly used approaches, the global test and 
gene set enrichment analysis (GSEA). 

Conclusion: We develop a gene set analyses method (TEGS) under the multivariate regression framework, which 
directly models the interdependence of the expression values in a gene set using a working covariance. TEGS 
outperforms two widely used methods, GSEA and global test in both simulation and a diabetes microarray data. 



Background 

Genome-wide analysis using microarray data, includ- 
ing RNA expression, DNA copy number and epigenetic 
DNA methylation, has become a popular tool in genomic 
research. Single gene/marker analysis provides a quick 
and convenient tool to identify top genes that might be 
associated with phenotypic trait. However, it is subject to 
a large number of false positives due to a large number 
of comparisons, and does not fully take into account that 
some genes have similar biological functions and work 
together. 

Microarray gene expressions or genetic markers usually 
have natural groupings based on biological knowledge. 
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For example, multiple genes belong to the same biolog- 
ical pathway or network; or contiguous copy number- 
detecting probes belong to the same gene or cytoband. 
Incorporating the prior knowledge or annotation about 
the grouping underlying the genome-wide data can make 
the results more interpretable. Note that the grouping may 
not necessarily come from biology. It can also be a cluster 
of genes identified using clustering methods. In this paper, 
these natural or statistical groupings are loosely called a 
gene set, which refers to a set of genes, or a set of markers 
or simply a set of probes. 

Numerous approaches for gene set analyses have been 
proposed [1], including the overrepresentation analysis 
[2], the univariate tests [3], the multivariate tests [4,5], the 
global test [6], and gene set enrichment analysis (GSEA) 
[7,8] and its variant [9]. The overrepresentation-type anal- 
ysis has been found to suffer from methodological prob- 
lems, which may lead to confusing results [10]. The global 
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test and GSEA improve over the overrepresentation-type 
analysis. The global test regresses the phenotype on gene 
expressions in a gene set and tests for regression coeffi- 
cients. GSEA performs a modified Kolmogorov-Smirnov 
test by comparing a gene set with the rest of the genes 
in the genome. However, the test statistics used in both 
methods ignore the correlation among the genes in a 
gene set and hence are subject to loss of statistical power, 
as genes in a gene set are often correlated and function 
together. The univariate test does not account for the 
correlation and loses power when the interdependence 
within the gene set is high, compared with the multivariate 
tests [11]. 

We propose in this paper to test for the effect of a 
gene set using a variance component test in multivari- 
ate regression model, where the correlation among genes 
in a gene set is explicitly taken into account. We term 
this test TEGS (Test for the Effect of a Gene Set). Specif- 
ically, we regress the gene expressions in a gene set on 
an independent variable, such an exposure or biological 
state variable, e.g., smoking (yes/no) or lung cancer status 
(yes/no), using multivariate regression, where the correla- 
tion among genes in a gene set is modeled using a working 
covariance matrix. As the number of genes might be large 
in a gene set, we develop a variance component score 
test for testing the effects of the exposure/biological state 
on the overall gene set profile by assuming regression 
coefficients follow a common distribution. 

We show that TEGS includes the global test of Goeman, 
et al (2004) as a special case when correlation among the 
genes in a gene set is ignored. We conduct simulation 
studies to evaluate the finite sample performance of TEGS 
and compare it with the global test and GSEA. We apply 
the proposed method to analysis of the Type II Diabetes 
data set [7]. 

Methods 

The model 

Suppose that there are n subjects and subject i has p con- 
tinuous outcomes Ya, Y/2> • • • , Yi P . In gene set analysis, the 
p outcomes indicate the expression values of p genes in 
a gene set, and X{ is an independent variable, e.g., expo- 
sure/biological state variable, such as mutation status: 1 if 
mutant and 0 if wild-type; or disease status (yes/no) for 
subject /. We consider the multivariate linear model 

Yy = oij+Xify + €ij, i = l,2,...,n and ; = 1,2, (1) 

where the errors, €i = (en, 6/2, — > £i P ) T are assumed to be 
independent across different subjects and follow a multi- 
variate normal distribution with mean 0 and true covari- 
ance E, which is often unknown, and otj is the average 
expression value of gene j for those with x = 0. Covariates 
can be incorporated in the model (1) by expanding otj to 



be X!a:=i a jk z ik where K is the number of covariates plus 
one (i.e., the intercept), Zik is the covariate h of subject /, 
Za is 1, and is the regression coefficient of the covariate 
k for the gene However, because the data we are deal- 
ing with has small n and large p, we would need the ridge 
regression to estimate a ft. If x\ is binary, e.g., disease sta- 
tus, is the mean difference of the expression levels of 
gene j between the two disease groups. Model (1) can be 
written in matrix notation by stacking data of n subjects 
and p gene expressions as 

Y =Ja+Xp + e, (2) 

where Y=(Y\, • • • , Y^) T is an^xl vector, Y t = (Ya, 
Yi2,...,Y ip ) T , € = (€*,...,€T) T , J = (I P ,.-,I P ) T ,X = 
(xil p >- • -,x n I p ) T ,ot = (a h a 2 ,- • ;a p ) T ,P=(Ph p2>- • •, P P ) J 

Gene set analysis using TEGS: A variance component score 
test 

The null hypothesis H o : f$ = 0 indicates that x\ has 
no effect on the mean of gene expression profile Y{ in 
a gene set. A traditional multivariate test for Ho [4] is 
based on a p-degree of freedom test and hence has limited 
power when the size of the gene set p is large, especially 
in the presence of a large number of null genes. To over- 
come this problem and improve test power, we assume 
the regression coefficients follows an arbitrary common 
distribution with mean 0 and variance r. The model (2) 
becomes a linear mixed model [12]. The null hypothesis 
Ho : P = 0 is equivalent to the null hypothesis for the 
variance component Ho : r = 0. To test for Ho : r = 0, 
one can perform a variance component score test [13]. 

Specifically, following Lin (1997), simple calculations 
show that the score for the variance component r under 
the induced linear mixed model is 

(Y -Ja) T Y- l XX T Y-\Y -Ja) - tr (x^XX 7 ) , (3) 

where Y n = diagiY, • • • , X) is an np x np block diagonal 
matrix. As the second term does not depend on data, we 
use the first term to construct the test statistic 

Q T = {Y-Jci) T ^- l XX T ^-\Y-j6i\ (4) 

where a is the maximum likelihood estimator of a 
under Ho- One can easily show that under Ho, & = 
(J T Y- l jy l J T Y- l Y = Y, where Y = /T 1 £JLi 
is simply the sample mean. Hence Equation (4) can be 
written as 

Q T = Y T (I - H)Y~ l XX T ^- 1 (I - H)Y, 

where H = n~ l JJ T . As Qt is quadratic in Y. Some cal- 
culations show that Qt follows a mixture of chi- square 
distribution J2j fyXij* where the weights kj are the eigen- 
values of the matrixX r (/ - H)Y,- l (I - H)X. 
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The test statistic Qt depends on the true covariance 
matrix X of Yi, which is often unknown in practice and 
requires estimation of a large number of parameters. 
Although sample covariance can be used to estimate £, 
it is not stable when the size of the gene set p is large or 
moderate and sample size is small We hence propose the 
use of a working covariance V for e / in (1) [14], which 
has a simpler structure and depends on a small num- 
ber of parameters. We derive a variance component test 
for Ho : r = 0 assuming € / has a covariance V, which 
might misspecify the true covariance E. Under this work- 
ing model, similar calculations show that the variance 
component score statistic for Hq : r =0 is 



Q = Y T (I - H)V~ 1 XX T V- 1 (I - H) Y, 



Yi and is the optimal test statistic within the TEGS statis- 
tic family, but cannot be calculated in practice as the true 
covariance of Yi is unknown. 

Special case of two group comparison and relationship of 
TEGS with the global test 

Consider the setting of testing for the effect of a binary 
exposure/disease status on expressions in a gene set, i.e., 
Xi is binary (0/1), some calculations show that the TEGS 
statistic Q in (5) can be simplified as 



Q ={?(!) - Y{2)} T V- l V- l {Y(l) - Y(2)} 



(5) 



_ \ n\V2 
1 n 



E 

7=1 



^ k lYk(X)-Y k (2)] 
k=i 



(6) 



where V n = diag(V \ • • • t V). We term the variance 
component test using Q TEGS (Test for the Effect of a 
Gene Set). 

Examples of working covariance V include working 
independence (Indpt), which assumes the genes are inde- 
pendent in a gene set; factor analysis covariances assum- 
ing two factors (F-2); adaptive factor analysis covariance 
with the estimated number of factors explaining up to 80% 
variability (F-adpt), compound symmetry (CpSym), which 
assumes the same pair-wise correlation among genes; and 
unstructured sample covariance (Unstr). 

The unstructured sample covariance is estimated using 
the residuals 6// obtained by performing separate sim- 
ple linear regression of individual gene expressions 
on Xi in (1). When xi is binary, e.g., disease=yes/no, 
is simply the ;th centered outcome using the group spe- 
cific means. When the number of genes in a gene set is 
large and the sample size is small, the standard empiri- 
cal unstructured sample covariance estimator is unstable. 
We hence stabilize it using a ridge estimator by adding 
the 5th percentiles of sample variances to the diagonal 
of the empirical covariance estimator. Estimation for the 
compound symmetry covariance and the factor analysis 
covariance was based on the ridge unstructured covari- 
ance estimator. Specifically, for the compound symmetry 
covariance estimator, the pair-wise covariance is esti- 
mated as the sample mean of the off-diagonal elements 
of the ridge unstructured covariance estimator. The two- 
factor and adaptive-factor covariances are estimated by 
singular value decomposition of the ridge unstructured 
covariance estimator. 

We discuss in Section Null distribution of TEGS estima- 
tion of the p-value using the TEGS test statistic Q. We per- 
fromed simulation studies to investigate the performance 
of size and power using different working covariances in 
a wide range of scenarios and compare TEGS with that 
using Qt, which is based on the true covariance matrix of 



where Y(l) and Y(2) are the sample mean of the out- 
come vector for group 1 and 2, and Y#(l) and Yfr(2) (k = 
1, • • • ,p) are their components, \S k is the (/, /c)th element 
of V~ l . This suggests that the TEGS statistic Q compares 
the weighted average of the outcome-specific mean dif- 
ferences of the gene expression profiles between the two 
groups. 

If one assumes working independence V = /, the TEGS 
test statistic Q in (6) becomes 



Qind = [Y(l) ~ Y(2)} T {Y(1) - Y(2)} 
H k=l 



(7) 



It can be easily shown that the TEGS statistic that 
uses the working independence covariance among gene 
expressions in a gene set (Qi n d) is the same as the global 
test of Goeman, et al (2004). Although the global test is 
equivalent to the TEGS with working independence, it is 
still derived under the model (1) where the true covariance 
£ is not necessarily independent. 

Specifically, the global test is derived as the variance 
component test under the logistic regression model of the 
binary disease status Xi on the p gene expressions 



logitijti) = 7o + 7i Yn H h YpY ip . 



(8) 



where tt; = Pr(xi = l\Yn, • • • , Yi p ) is the probability of 
disease given the gene expression profiles in a gene set. 
Under the logistic model (8), to test for the null hypothe- 
sis of no gene set effect on disease status Ho : y\ = • • • = 
y p = 0, Goeman, et al (2004) assumed the coefficients 
Yj are independent and follow an arbitrary distribution 
with mean 0 and variance r. The logistic model (8) hence 
becomes a logistic mixed model [15]. It follows that the 
null hypothesis Hq : 71 = • • • = Yp = 0 is identical to 
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Ho : r = 0. Goeman, et al (2004) derived the variance 
component score test for Ho : r = 0 and termed it as the 
global test. One can easily show that the global test is iden- 
tical to Qi n d in (7), apart from a term that does not depend 
on Y. 

A comparison of (6) and (7) shows that TEGS has the 
flexibility to account for different correlations among gene 
expressions in a gene set by comparing the weighted 
differences of the means of gene expressions between 
two groups, while the global test, which is the same as 
the TEGS assuming working independence among gene 
expressions, ignores correlation among gene expressions. 
One hence would expect that TEGS that accounts for 
within gene set correlation is likely to be more powerful 
than the global test. 

Another testing procedure that is closely related to 
TEGS is the Sequence Kernel Association Test (SKAT), 
a method developed to analyze SNP (single nucleotide 
polymorphism) or sequence data in genome-wide associ- 
ation studies [16]. It has been shown that the global test is 
equivalent to the SKAT with linear kernel [16,17]. Thus, 
the TEGS with working independence is equivalent to 
the SKAT with linear kernel. However, TEGS with other 
working correlations and SKAT with other kernels do not 
have an obvious correspondence. 

Null distribution of TEGS 

As the TEGS statistic Q in (5) is a quadratic function 
of Y, we have shown that it follows a mixture of chi- 
square distributions, where the weights depend on the 
true covariance £ and the working covariance V . We 
propose two methods to estimate the p-value of TEGS. 

Permutation 

One approach to calculate the p-value for the TEGS statis- 
tic Q, is based on permutation, where we permute the Xi's, 
and calculate Q for each permuted dataset and compare 
the observed value of Q with those calculated based on 
the permuted samples. Note that for each permutation, V 
need to be re-estimated given an assumed structure, e.g., 
under independence, unstructured, exchangeable, as v is 
the covariance conditional on the x. If the sample size is 
large (i.e. >100), one may use the Monte Carlo approach 
proposed by Lin [18]. 

Scaled x 2 approximation 

The second approach is to compute the p-value for the 
TEGS statistic Q is to use the Satterthwaite method [19] to 
approximate the null distribution of Q, which is a mixture 
of x 2 distributions. The Satterthwaite method approxi- 
mates the null distribution of Q by a scaled x 2 distribution 
KX 2 > where k is the scale parameter and v is the degree of 
freedom. The values of k and v can be estimated by match- 



ing the first two moments of Q under Ho with those of the 
the scaled x 2 distribution as 

Var(Q) 2[E(Q)] 2 
2E(Q) ,V Var(Q) * 

We estimate the mean and variance of Q under the null 
using permutation and denote the p-value estimated using 
this approach as p KX 2. Using the Satterthwaite approxima- 
tion, we are able to calculate small p-values based on a 
smaller number of permutations than the first method. 

Normal mixture approximation 

In order to achieve better precision of smaller p-values, 
we further propose a method using the normal mixture 
approximation [20]. Specifically, we fit a two-population 
normal mixture 7TiA/X/xi, cr-j 2 ) + ^A/X/^, o^) for the 
<b~ l (p {yb \) where p (yb \ is the scaled x 2 approximated p- 

value for the statistic obtained at permutation b, 
b = 1, B (B is the number of permutation), O is the 
cumulative distribution function of the standard normal, 
and Tt a , /Ji a and a% are proportion, mean and variance of 
the normal distribution a [a = 1, 2), respectively, p-value 
can then be estimated as the tail probability by comparing 
®~ l (P KX ^ and niN(fri,&i) + 7r 2 A/X/X2, <t 2 2 ) where jX a oj 
and 7t a , respectively are maximum likelihood estimates of 
li a , cr 2 and 7i a . 

Power calculations 

To design a new study using a gene set analysis, one 
needs to perform power calculations. We discuss in this 
section power calculations using TEGS. The distribution 
of Q under the alternative hypothesis follows a mixture 
of non-central chi-square distributions. We approximate 
this distribution using a scaled non-central chi-square 
distribution kx 2 (<$). Specifically, we first estimate k and 
v under Ho as k = Var// 0 (Q)/[2E// 0 (Q)] and v = 
2[E //o (Q)] 2 /Var //o (Q), where E Ho (Q) and Var^CQ) are 
the mean and the variance estimated theoretically as 
under the null as 

E Ho (Q) = tr ((/ - H)V 1 XX T V 1 (I - if)z) 

Var Ho (Q) = 2tr ((/ - H)V 1 XX T V 1 (I - H)Y 

(I - H)V 1 XX T V 1 (I-H)T,y 

For power calculations, to estimate the non-centrality 
parameter <5, the theoretical mean Eh a (Q) under the alter- 
native is 

E Ha (Q) = tr ((/ - H)V 1 XX T V 1 (I - 

+ p T X T (I - H)V~ 1 XX T V~ 1 (I - H)Xp. 

Setting Eh a (Q) = (v + 8)k, which is the mean of 
K X 2 ° ne can solve for 5, and calculate the power of 
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the test using Pr(/y (8) > Xv or) wnere a is the size of the 
test. The true covariance X and the working covariance 
V can be estimated using the pilot data. One can perform 
calculations by varying and the effects 

Simulation study 
Single gene set 

We simulated the data using model (1). Two different 
combinations of n and p were considered: h=50 and p=lO 
and h=20 and j?=40. Four different true covariances of 
€i, £, were investigated: (1) compound symmetry (CS), 
where we assumed the diagonal elements equal to 1 
and the off-diagonal elements equal to 0.1 or 0.5; (2) 
first-order autoregressive (AR1), where we assumed the 
diagonal elements equal to 1 and off-diagonal elements 
decay by a factor of 0.5 or 0.8; (3) two factor covari- 
ance (F2): X = PiPf + P 2 Pj + diag{u}, where the p 
elements of the two factors, Pi and P2 were generated 
independently from two Gaussian distributions, and u 
was chosen to make the diagonal elements of the £ 
equal to Is; (4) the unstructured covariance (UNS), 
which was assumed to be the sample covariance of 
MAP00240_Pyrimidine_metabolism (/?=40) using the 
Type II Diabetes data in Mootha et al (2003). The sample 
covariance of MAP00240_Pyrimidine_metabolism was 
calculated based on 17 subjects with normal glucose 
tolerance and 17 Type II Diabetes patients by conditioning 
on the disease status. To avoid singularity of the sample 
covariance, the 5th percentile of the diagonal elements 
was added to the diagonal to construct the true covariance 
matrix used in simulations. 

The regression coefficients P was set by varying the pro- 
portion of non-zero /3 s and their magnitudes. For n=50 
and p=10, 0%, 40% and 80% of fi's were set to non-zero. 
The non-zero /3s were set to be /3=±0.25 or ±0.5. For 
n = 20 and /?=40, 0%, 25%, 50% and 60% of ps were 
set to be non-zero. The non-zero /3s were set to be ±0.5 
or ±1.0. The numbers of (-0.5,0.5) (or (-1.0,1.0)) are 
(2, 8), (5, 5), (5, 15), (10, 10), (5, 25), (10, 20) and (15, 15). 
The effect size is summarized by an index, Y?j=\ fijl® 2 
where a 2 is the average variance of the p gene expression 
in the same gene set, in the power plots given in Figures 1, 
2, 3 and 4. 

For each simulation and each true covariance config- 
uration, we compared the performance of TEGS assum- 
ing six different covariance matrices: true covariance, 
unstructured covariance, independence, two factor anal- 
ysis covariance, adaptive factor analysis covariance, and 
compound symmetry. Note that the TEGS assuming 
working independence corresponds to the global test 
(Goeman et al. 2004). The p- values were calculated as the 
tail probability of Q under the null distribution. The null 
distribution was approximated by the methods described 



in Section Null distribution of TEGS. A total of 1000 per- 
mutations were performed to nonparametrically approx- 
imate the null distribution of Q. A total of 5000 and 
1000 simulations, respectively were run for the setting 
under the null hypothesis (i.e., fi=0) and the alternative 
hypothesis to calculate sizes and powers. Type I error was 
calculated at the a =0.05 level. Statistical power was calcu- 
lated as the percentage of p-values less than 0.05 among 
1000 simulations. 

Multiple gene sets 

Gene set enrichment analysis (GSEA) is a widely used 
approach to study the enrichment of a gene set in a 
large number of genes, which often consists of multiple 
gene sets. The null hypothesis hence corresponds to the 
competitive null hypothesis [10]. To compare the perfor- 
mance of our proposed method TEGS with GSEA, we set 
up a simulation study involving multiple gene sets. The 
configuration is as follows: 

• Setting 1 : We set n =20 and the number of gene sets 
be 20. Ten gene sets have p=10 genes (gene sets #1- 
10). Among them, six gene sets are under the null and 
four gene sets are under the alternative. The other 
ten gene sets have p=40 genes per gene set (gene sets 
#11-20). Among them, six gene sets are under the 
null and four gene sets are under the alternative. 
Among the gene sets under the alternatives, we 
allowed some genes to have no effects (i.e., those with 
Pj = 0), and varied the number of signal genes (i.e., 
those with 7^ 0). The number and magnitude of 
non-zero /Ts or each of the gene sets under the 
alternative hypothesis are given in the top of Table 1. 
This setting has a total of 500 genes, with the total 
number of signal genes equal to 104 spreading across 
8 gene sets and the total number of null genes equal 
to 396. We assumed in this setting the 20 gene sets 
were uncorrelated. Within each gene set, we assumed 
the genes were correlated with the two factor 
covariance matrix: v* = PiPf + P2PJ + diag{u}. 

• Setting 2: This setting is identical to Setting 1 except 
that we allowed correlation among the gene sets: gene 
sets #1-3, #4-6, #7-9, #11-13, #14-16 and #17-19 are 
correlated. The correlation structures are estimated 
by two factor covariance from the sample covariances 
of the gene sets with p = 30 and 120 in the diabetes 
dataset. We marked correlated gene sets in 

Table 1. 

• Setting 3: This setting is identical to Setting 2 except 
that we added additional 4500 null genes to the 500 
genes in the 20 gene sets. This setting mimics more 
practical gene expression studies. This gives a total of 
5000 genes, with 104 signal genes spreading across 8 
gene sets and 4896 null genes. Among the 20 gene 
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Figure 1 Power comparison using simulations with the true covariance to be compound symmetry. We set n=20 and p=40; covariance 
(cov)=0.1 (top panel) and 0.5 (bottom panel). For each panel, the left plot is based on permutation and the right plot is based on the Satterthwaite 
scaled chi-square approximation. For each plot, the powers of TEGS assuming six different covariances were compared. TEGS assuming working 
independence (Indpt) corresponds to the global test of Goeman et al. (2004). The horizontal dotdash line indicates the size of the test (i.e., 5%). 



sets, same as before, there are 8 gene sets under the 
alternative and 12 null gene sets. 

For each setting, we applied TEGS and GSEA to each of 
the 20 gene sets to compare size and power. 

Application: reanalysis of Type II Diabetes data 

We applied the proposed method to analyze the Type II 
Diabetes gene expression data, which were previously ana- 
lyzed by Mootha et al. (2003) using GSEA to study for 
the pathway effects. The original data have three patient 
groups: normal glucose tolerance, impaired glucose toler- 
ance, and Type II Diabetes. To illustrate our method and 
compare it with GSEA, we restricted our analysis to two 
groups: 17 patients with normal glucose tolerance and 17 
patients with Type II Diabetes. A total of 124 out of 149 
gene sets were analyzed here. We excluded 25 small gene 
sets, which have less than four probes. 

We performed TEGS assuming five different work- 
ing covariances, including independence, unstructured 



covariance, factor analysis covariance using two fac- 
tors and the number of factors that explain up to 80% 
variability, and compound symmetry covariance. We cal- 
culated the p-values using permutation and the Satterth- 
waite method described in Section Null distribution of 
TEGS. The number of permutations for each gene set was 
2000. The working independence TEGS corresponds to 
the global test [4]. We compared the performance of TEGS 
with GSEA. The q value, an index measuring the false dis- 
covery rate (FDR) [21,22], was used to adjust for multiple 
comparisons. 

Results 

Simulation study 
Single gene set 

Four true covariances were considered in the simulations: 
compound symmetry, AR1, two factor, and unstructured 
covariance. The results are presented in Figures 1, 2, 3 
and 4. For each true covariance, we compared the powers 
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Figure 2 Power comparison using simulations with the true covariance to be AR1 . R is the value of the lag 1 autocorrelation. We set n=20 and 
p=40; covariance (cov)=0.1 (top panel) and 0.5 (bottom panel). Other notations and symbols are similar to Figure 1 . 



of TEGS assuming the true covariance and five different 
working covariances. 

Type I error rate is well protected at the size of 5% 
in all the settings with different approximation methods. 
For lower levels of the size (0.5% and 0.05%), different 



approximation methods perform well when using the true 
covariance (see Additional file 1: Table SI). For different 
working correlations, the permutation method protects 
the type I error at 0.5% and 0.05% where the type I error 
rate using the Satterthwaite approximation is inflated at 
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Figure 3 Power comparison using simulations with the true covariance to be two factor covariance. We set n=20 and p=40. Other notations 


and symbols are similar to Figure 1 . 
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Figure 4 Power comparison using simulations with the true covariance to be the stabilized sample covariance of 
MAP00240_Pyrimidine_metabolism. We set n=20 and p=40. Other notations and symbols are similar to Figure 1 . 



0.5% and 0.05% and that using the normal mixture approx- 
imation is well-protected at 0.5% but inflated at 0.05%. 

In all settings, TEGS assuming the true covariance 
(the black solid line) has the best power, while 
TEGS assuming independence among the genes 
(the black dash line), has the lowest power. TEGS calcu- 
lated by accounting for within-gene set correlation using 
an estimated working covariance is less powerful than that 
assuming the true covariance, but more powerful than 
TEGS assuming independence among the genes. As TEGS 
assuming independence among the genes is the same as 
the global test the results indicate that TEGS accounting 
for correlation among genes in a gene set is more powerful 
than the global test. A comparison of the top panel and the 
bottom panel in Figures 1 and 2 shows that the power gain 
of TEGS accounting for correlation among genes over the 
global test is more pronounced when the correlation is 
stronger. 

When the working covariance structure is correctly 
specified, TEGS using the estimated covariance has the 
power closest to that using the true covariance. For exam- 
ple, when the true covariance is compound symmetry, 
TEGS assuming the compound symmetry structure with 
the constant pair- wise covariance estimated from the data 
has the power curve closest to that assuming the true 
compound symmetry covariance with pair-wise covari- 
ance equal to 0.1 or 0.5. 

When the true covariance is the sample covariance of 
MAP00240_Pyrimidine_metabolism estimated from the 
diabetes data (Figure 4), TEGS obtained by estimating 
the covariance matrix using any of the working covari- 
ance matrices gives similar results, all outperforming 
TEGS assuming working independence among the genes 
(i.e., the global test). In all settings, TEGS assuming two 
factor analysis (F-2) and adaptive factor analysis (F-adpt) 



have most robust performance, and give powers that are 
closest to TEGS assuming the true covariance structure. 
The simulation results for the setting with n=50 and p-Vd 
show similar patterns to those with #=20 and /?=40, and 
are provided in the Additional file 1. 

Multiple gene sets 

Table 1 compares the performance of TEGS and GSEA in 
settings 2 and 3. The results in Table 1 show that the num- 
ber of signal genes not in the gene set affects the type I 
error rate and power of GSEA, but does not affect TEGS. 
For example, when the total number of genes is 500, the 
size of GSEA for testing a null gene set is somewhat con- 
servative, and less than the nominal size 0.05. As the total 
number of genes increased to 5000 with much more null 
genes, the size of GSEA for testing a null gene set became 
closer to 0.05. 

A comparison of the powers of the 8 gene sets under the 
alternative show that TEGS has better power than GSEA. 
When the number of genes increases from 500 to 5000 
with the number of signal genes remaining the same, i.e., 
increasing the number of null genes, the power of GSEA 
for testing the effect of a gene set does not improve much. 
The reason can be explained as below. GSEA tests for 
competitive null hypothesis. For a given gene set, say gene 
set 1, when the 4500 null genes are added to the set of 
500 genes, the proportion of signal genes in gene set 1 
remains the same (4/10=40%), while the proportion of sig- 
nal genes not in the gene set decreases from 100/490=20% 
to 100/4990=2%. Although the difference of the propor- 
tions of signal genes in gene set 1 and not in gene set 
1 becomes bigger, as the size of gene set 1 remains the 
same as 10, the p-value using the Kolmogorov-Smirnov 
test does not change much. Note that as TEGS tests for 
a self-contained null hypothesis [10], its power remains 



Table 1 The simulation results comparing size and power of TEGS and GSEA 
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The entries are the percentage of times the p-values are less than 0.05. The columns with non-zero fi's represent the setting under the alternative and those with zero fi's represent the settings under the null. Abbreviations 
of the working covariance are the same as those in Figure I.Two settings are considered: the first has 500 genes with 20 gene sets (8 gene sets under the alternative and 12 null gene sets of sizes 10 and 40); the second 
setting has 5000 genes with 20 gene sets (8 gene sets under the alternative and 12 null gene sets of sizes 10 and 40) and 4500 additional null genes. The power of TEGS for a given gene set is the same for the two settings. 
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the same as the total number of genes increases from 500 
to 5000. 

The power of TEGS increases quickly with the average 
effect size of a gene set, Y?j=i fy/^ 2 * while the power using 
GSEA improves slightly. This is because GSEA assesses 
whether the genes in a gene set are enriched towards the 
top in a list of all the genes, where individual genes are 
ranked by their p- values. Hence the difference between 
the proportion of signal genes in a given gene set and the 
proportion of signal genes not in the gene set affect the 
p-values calculated using GSEA, while the magnitudes of 
the signal genes have limited impact. 

A comparison of the results of gene sets of size 10 with 
those of size 40 in Table 1 shows that the size of a gene 
set, p, affects TEGS and GSEA differently. A smaller gene 
set, e.g.,/? =10, is less likely to be identified as significant 
using GSEA. However, the effect of the size of a gene set 
on TEGS is smaller. We report in Additional file 1: Table 
S2. The results when gene sets are independent. 

To run one simulation data generated in the setting 
2 (500 genes) on a desktop with 2.53 GHz CPU, the 
computation time of TEGS and GSEA (both with 200 per- 
mutations) is about 5.6 and 4 seconds, respectively. For 
the setting 3 (5000 genes), the computation time of TEGS 
and GSEA is about 60 and 14 seconds, respectively. The 
most computation burden in TEGS is to invert the work- 
ing covariance in each permutation {V~ l in (5)). Thus, 
the analysis with larger gene sets can cost much more 
computation. 

Application: re-analysis of Type II Diabetes data 

TEGS assuming independence among the genes identi- 
fied 5 and 8 gene sets with p values less than 0.05 using 
permutation and Satterthwaite methods, respectively. The 
corresponding numbers of gene sets were 13 and 14; 
15 and 14; and 9 and 10 using TEGS by estimating the 
covariance assuming the two factor analysis covariance, 
the adaptive factor analysis covariance, and the unstruc- 
tured sample covariance. GSEA identified 4 differentially 
expressed gene sets. The over-lapping numbers of dif- 
ferentially expressed gene sets between TEGS using the 
four working covariances and GSEA are shown in Table 2. 
TEGS assuming adaptive factor analysis covariance iden- 
tified 10 gene sets with FDR less than 0.1 and 20 gene sets 
with FDR less than 0.15. 

The gene set MAP00252_Alanine_and_aspartate_ 
metabolism was identified as differentially expressed 
between Type II Diabetes patients and those with normal 
glucose tolerance: p-value=0.006 using TEGS assuming 
adaptive-factor covariance, p=0.003 using TEGS with 
exchangeable covariance, p-value=0.005 with TEGS with 
independence covariance and 0.054 using TEGS with 
unstructured covariance; p=0.028 using GSEA. Figure 5a 



Table 2 Results of re-analyses of 1 24 gene sets in Type II 
Diabetes data 
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Each cell indicates the over-lapping number of differentially expressed gene sets 
at the nominal p-value=0.05 level using TEGS with different working covariance 
and GSEA shown in the corresponding column and row. 



shows that five genes in this gene set were differen- 
tially expressed based on single-gene analysis with the 
£-test. The heatmap in Figure 5a also show that dia- 
betes patients have higher expression in the upper two 
third of the genes but lower expression in the lower 
one third. Another interesting gene set we identified is 
MAP00531_Glycosaminoglycan_degradation (Figure 5b), 
which was found statistically significant using TEGS 
with different working correlations: p- values are 0.00034 
(adaptive-factor covariance), 0.0032 (unstructured), 0.021 
(independence) and 0.022 (exchangeable), but was not 
significant using GSEA: p-value=0.39. 

Discussion 

The power of TEGS is improved by accounting for the cor- 
relation among genes within the gene set, especially when 
the working covariance gets closer to the true covariance, 
and outperforms the TEGS assuming working indepen- 
dence. We have also shown that the TEGS with working 
independence among genes in a gene set corresponds to 
the global test proposed by Goeman, et al (2004). Our 
numerical studies show that the TEGS assuming two fac- 
tors or adaptive factor covariance matrix overall works 
well in practice for difference true covariance structures, 
especially when the number of genes p is larger than 
sample size n. 

We have compared the performance of TEGS with 
GSEA. Both tests borrow information across multiple 
genes in a gene set and are hence beneficial when a gene 
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Figure 5 Gene expression values in two gene sets, (a) MAP00252_Alanine_and_aspartate_metabolism and (b) 
MAP00531_Glycosaminoglycan_degradation.The main plot is the expression values of genes standardized by the means and standard 
deviations. Each column represents a patient labeled by either "N" (normal glucose tolerance) or "D" (Type II Diabetes). The right panel represents 
the p values from t-test of single gene analysis where the dashed line indicates p=0.05 and the dotted line indicates p=0.2. 



set has multiple signal genes with modest effects. TEGS 
and GSEA differ in several aspects. The TEGS statis- 
tic is constructed by accounting for correlation among 
genes in a gene set, while GSEA uses individual gene p- 
values to calculate the Kolmogorov-Smirnov test, which 
ignores the within gene set correlation. TEGS considers 
the self-contained null hypothesis, while GSEA consid- 
ers the competitive null hypothesis [10]. GSEA studies the 
enrichment of genes in a gene set by testing the relative 



rankings of the genes in a gene set among all the genes 
under investigation. GSEA hence is influenced by the size 
of a gene set, the proportion of signal genes in the gene 
set, and the proportion of signal genes not in the gene set. 
When the proportion of signal genes in a gene set is much 
larger relative to that not in the gene set, GSEA performs 
well. We note that GSEA has difficulties in capturing a dif- 
ferentially expressed gene set when the number of genes 
containing true effects is small in a gene set even if the 
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effects of these signal genes are strong. When the size of 
a gene set is small/modest, the power of GSEA does not 
increase much when the number of null genes not in the 
gene set increases or when the sample size increases. As 
TEGS considers a self-contained null hypothesis, i.e., test- 
ing whether a gene set is differentially expressed, it is not 
affected by the behavior of the genes not in the gene set. 
TEGS improves power when sample size increases or the 
magnitudes of signal genes in a gene set increase. How- 
ever, TEGS does not directly compare a gene set with 
other gene sets, although one can rank gene sets using p- 
values calculated using TEGS. Our numerical results show 
that TEGS outperforms GSEA in terms of size and power, 
although the powers are not directly comparable as they 
test for different null hypotheses. 

Due to the nature of the null hypothesis we specified, it 
is possible that a significant gene set from our proposed 
testing procedure is driven by one or two very significant 
genes, which is less likely to occur in GSEA. There are sev- 
eral possible ways to guard against this. For example, after 
a gene set is identified to be significant, one can perform 
single gene analyses to further characterize how the sig- 
nals are distributed within the gene set. Or, one can use 
the same multivariate model as (1) to estimate and test 
the association of each gene with the phenotype using the 
ridge regression. 

TEGS is not limited to testing mRNA expression in 
biological pathways/networks. It can also be applied for 
testing the effects of other genomic markers, such as DNA 
copy numbers, RNA or protein expressions, and DNA 
methylations in a genomic region or a functional set. 

Conclusions 

We have proposed in this paper a new method for the gene 
set analysis, TEGS. By introducing a working covariance, 
TEGS directly models the interdependence of the expres- 
sion values in a gene set, the most important feature of 
biological pathways or gene sets that is often overlooked 
in existing methods. TEGS incorporates information from 
multiple genes in a gene set through the working covari- 
ance and thus outperforms two widely used approach, 
GSEA and global test in simulation studies and a diabetes 
microarray data. 
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